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THE INVERSE FAST MULTIPOLE METHOD: USING A FAST 
APPROXIMATE DIRECT SOLVER AS A PRECONDITIONER FOR 
DENSE LINEAR SYSTEMS 

PIETER COULIERtt, HADI POURANSARlt, AND ERIC DARVEt 

Abstract. Although some preconditioners are available for solving dense linear systems, there 
are still many matrices for which preconditioners are lacking, in particular in cases where the size 
of the matrix N becomes very large. Examples of preconditioners include ILU preconditioners that 
sparsify the matrix based on some threshold, algebraic multigrid, and specialized preconditioners, 
e.g., Calderon and other analytical approximation methods when available. Despite these methods, 
there remains a great need to develop general purpose preconditioners whose cost scales well with 
the matrix size N. In this paper, we propose a preconditioner, with broad applicability and with cost 
0{N), for dense matrices, when the matrix is given by a smooth kernel. Extending the method using 
the same framework to general H^-matrices (i.e., algebraic instead of defined in terms of an analytical 
kernel) is relatively straightforward, but this won’t be discussed here. These preconditioners have 
a controlled accuracy (e.g., machine accuracy can be achieved if needed) and scale linearly with N. 
They are based on an approximate direct solve of the system. The linear scaling of the algorithm is 
achieved by means of two key ideas. First, the 'H^-structure of the dense matrix is exploited to obtain 
an extended sparse system of equations. Second, fill-ins arising when performing the elimination of 
the latter are compressed as low-rank matrices if they correspond to well-separated interactions. 
This ensures that the sparsity pattern of the extended sparse matrix is preserved throughout the 
elimination, hence resulting in a very efficient algorithm with 0{N log(l/£’)^) computational cost and 
0{N logl/e) memory requirement, for an error tolerance 0 < £ < 1. The solver is inexact, although 
the error can be controlled and made as small as needed. These solvers are related to ILU in the sense 
that the fill-in is controlled. However, in ILU, most of the fill-in (i.e., below a certain tolerance) is 
simply discarded, whereas here it is approximated using low-rank blocks, with a prescribed tolerance. 
Numerical examples are discussed to demonstrate the linear scaling of the method and to illustrate 
its effectiveness as a preconditioner. 

Key words. Fast direct solver, preconditioner, "H^-matrices, extended sparsification, low-rank 
compression. 
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1. Introduction. Large linear systems of equations Ax = b involving dense ma¬ 
trices A arise in a variety of applications in science and engineering (e.g., in bound¬ 
ary element methods [6, 35], Kalman filtering [37], radial basis function interpola¬ 
tion [9, 24], etc.). Storing a dense matrix entrywise requires a quadratic amount of 
memory 0{N‘^) (where N indicates the matrix size), while solving the corresponding 
system of equations with a direct solver involves 0{N^) flops. In order to alleviate 
these stringent requirements, fast methods have been developed in the past decades, 
including the fast multipole method (FMM) [17, 22, 42, 47, 57], the panel clustering 
technique [30, 52], and methods based on hierarchical matrices [4, 8, 25, 27, 28]. These 
methods rely on the observation that many dense matrices can be represented in a hi¬ 
erarchical format, i.e., the original matrix can be subdivided into a hierarchy of smaller 
block matrices and most of these blocks are replaceable by low-rank approximations. 
Several categories of hierarchical matrix representations can be distinguished. In the 
simplest case, all off-diagonal blocks are assumed to be low-rank, which is referred 
to as a weak admissibility criterion [29]. If the matrix entries can be interpreted as 
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interactions between points in a physical domain, this criterion implies that all the 
interactions of a certain cluster of points with any other cluster (except itself) are 
low-rank. This assumption leads to hierarchically off-diagonal low-rank (HODLR) 
matrices [3] and hierarchically semi-separable (HSS) matrices [12, 53]. In the latter, 
a nested approach is used (i.e., the low-rank basis at a certain hierarchical level is 
constructed using the low-rank basis at the child level), while this is not the case in 
the former. If a more general strong admissibility criterion is adopted, %- and TH?- 
matrices are obtained for the non-nested and nested cases, respectively [4, 8]^. In these 
matrices, only interactions between non-neighboring clusters of points are assumed to 
be low-rank. The FMM can be considered as a subcategory of H^-matrices [4]. The 
reader is referred to [4, 8] for a detailed overview of the various hierarchical structures. 
See also Appendix C for additional discussion of this classification. 

The use of the aforementioned hierarchical matrix representations facilitates matrix- 
vector multiplications, hence paving the way for iterative solvers such as Krylov sub¬ 
space methods for solving the linear system Ax = b [49]. The precarious convergence 
behavior — and thus the number of iterations — is a major disadvantage of iter¬ 
ative solvers. This necessitates the incorporation of a preconditioning strategy to 
improve the latter, which is often a difficult and case-dependent problem. Typi¬ 
cal preconditioners for dense matrices include the incomplete LU factorization (with 
thresholding) [40], methods based on the sparse approximate inverse [10], geomet¬ 
ric and algebraic multigrid methods [39, 54], as well as analytical preconditioning 
techniques such as Calderon preconditioners [14, 16]. Although these techniques are 
usually well suited for the specific problem they have been designed for, their perfor¬ 
mance as robust and efficient black-box preconditioners for general problems remains 
rather limited. Another disadvantage of iterative solvers is that they do not lead to 
a factorization of A that can easily be re-used, implying that the algorithm needs to 
be restarted if the right hand side b of the system is modified. Direct solvers, on the 
other hand, are known to be more robust and are able to cope with multiple right 
hand sides, but at a high computational cost. 

Significant progress has been made in recent years regarding the development of 
fast direct solvers for hierarchical matrices, especially for the aforementioned cate¬ 
gory of HODLR and HSS matrices. For example, direct 0{N) algorithms for HSS 
matrices have been presented by, among others, Martinsson and Rokhlin [41], Chan- 
drasekaran et al. [12], Xia et al. [55, 56], and Gillman et al. [20], while fast solvers 
for HODLR matrices can be found in [3, 38]. These methods can be applied as high 
accuracy direct solvers but are often more effective if employed as preconditioners in 
an iterative solver. The framework used for HODLR and HSS matrices is appropri¬ 
ate for one-dimensional (ID) applications. For higher dimensional problems (2D or 
3D), the assumption that all off-diagonal blocks are low-rank does not hold in that 
case, that is the rank typically grows with N. For 2D problems (points distributed 
on a 2D manifold or 2D plane), the rank grows like 0{'/N) [1], hence jeopardizing 
the computational efficiency of the algorithms. It is worth noting that the approach 
presented in [41] has recently been extended by Corona et al. [15] to obtain a direct 
solver for HSS matrices that scales even for 2D problems as 0{N). This improvement 


^ There is a confusing point of terminology associated with strong and weak, which cannot really 
be removed. If we use a strong admissibility criterion, we are considering that all well-separated 
blocks are low-rank. This therefore must correspond to weak hierarchical matrices or w'R (e.g., FMM 
matrices). Similarly, if we use a weak admissibility criterion, we get strong hierarchical matrices or 
sH (e.g., HODLR and HSS). This terminology is imposed by the fact that we must have sM. C wH. 
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in complexity is achieved by introducing an additional level of compression, which 
also results in an improved memory efficiency of the algorithm. Extensions of the 
method towards surfaces in 3D are currently underway [15]. Another recent develop¬ 
ment towards fast direct solvers is the hierarchical interpolative factorization recently 
introduced by Ho and Ying [33, 34], which combines ideas from [15] and [32]. In 
order to obtain a direct solver with a linear complexity 0{N) for general 2D and 3D 
problems, the most general framework of %- and 'H^-matrices seems to be very well 
suited as well. 

In this paper, the inverse fast multipole method (IFMM) is presented as a fast 
direct solver for dense H^-matrices, where “direct” refers to the use of an approximate 
LU factorization [2]. We focus on matrices given by a smooth kernel. The solver is 
inexact, although the error can be controlled and made as small as needed. A low ac¬ 
curacy solver is hence applicable as a robust and efficient preconditioner for iterative 
schemes. The IFMM relies on two key ideas. First, the dense 'H^-matrix is converted 
to an extended sparse matrix through the introduction of auxiliary variables. Sec¬ 
ond, hll-ins^ arising during the elimination^ of the latter are compressed as low-rank 
matrices if they correspond to well-separated (i.e., non-neighboring) interactions. As 
a result, the sparsity of the extended sparse matrix is maintained throughout the 
elimination. Specifically, there is an upper bound in 0{N) on the number of non-zero 
entries, at all steps during the elimination. This results in a very efficient algorithm 
with linear complexity. To the authors’ knowledge, this is the first direct solver that 
achieves this complexity for the most general category of hierarchical matrices [2]. 
This paper only considers dense matrices, but similar ideas can be employed to de¬ 
velop a fast hierarchical solver with linear complexity for sparse matrices. Such a 
solver is described in a companion paper [44]. 

The proposed strategy shows similarities to multigrid methods as it uses a hier¬ 
archical decomposition of the domain, as well as to the incomplete LU (ILU) factor¬ 
ization and its variants as the sparsity pattern is preserved. Contrary to ILU, the 
sparsity is maintained using low-rank compression instead of thresholding. 

The outline of this paper is as follows. Section 2 introduces the IFMM and fo¬ 
cuses on the key ideas required to achieve linear complexity; special attention is paid 
to a graph representation of the method. Numerical examples are subsequently dis¬ 
cussed in Section 3 to illustrate the computational efficiency of the methodology, both 
as a direct solver and as a preconditioner in an iterative scheme. The applicability 
of the method to a Stokes flow problem is finally demonstrated in Section 4, while 
summarizing conclusions are drawn in Section 5. In Appendix C, we included a dis¬ 
cussion of theoretical connections between the method discussed in this paper and the 
algorithms developed by Hackbusch et al. [4, 8, 25, 27, 28]. 

2. The inverse fast multipole method: key ideas and algorithm. 

2.1. Extended sparsification. It is the aim to obtain a fast solver for the dense 
linear system Ax = b, where A is an H^-matrix, i.e., a matrix constructed based on 
strong admissibility criteria (well-separated clusters are assumed to be low-rank only), 
and with a nested basis for the low-rank representations. 

The first step in obtaining such a solver is to exploit the 'H^-structure of the 
matrix in order to convert the original dense system into an extended sparse sys- 


^The term fill-in is used here to denote the non-zero matrix entries. 

®The term elimination refers to removing variables from a system of equations and updating the 
coefficients of the remaining variables appropriately. 
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tern, as solving the latter is more efficient from a computational point of view. The 
sparsification procedure is based on recursively decomposing a dense matrix into a 
sparse matrix (representing the “near field” interactions) and a low-rank approxi¬ 
mation (representing the “far field” interactions), combined with the introduction of 
auxiliary variables. This is done in a multilevel fashion. Similar ideas have been 
considered for HSS matrices in [13, 32]. 

The sparsification concept is illustrated in the following paragraphs for a one¬ 
dimensional (ID) example, shown in Figure 1. A ID domain D is hierarchically 
subdivided into clusters D® (with super- and subscripts and i referring to the 
hierarchical level and the cluster index at that level, respectively) until the leaf level 
I = L is reached (with L = 3 in Figure 1). Blocks in the matrix A corresponding 
to interactions between well-separated clusters ^ and are replaced by low-rank 

approximations of rank r, i.e., ~ (see Figure 2(b)). For the exam¬ 

ple of Figure 1, the resulting matrix structure is shown in Figure 2(a). Note that the 
nested structure of the "H^-matrix is not shown in this figure. For readers familiar 
with the FMM, the matrices and V® can be identified as L2P- (or L2L-), 

M2L-, and P2M- (or M2M-) operators, respectively. 
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Fig. 1. One-dimensional domain O, hierarchically subdivided into clusters 
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Fig. 2. (a) Approximation of the dense matrix by an Id?-matrix, (b) Low rank approximation 
of an interaction between well-separated clusters: A.?) ~ . 
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The dense matrix A can consequently be decomposed at the leaf level as follows: 
(1) A = -P 

The matrix in Eq. (1) is a sparse matrix containing the interactions of leaf clusters 
with their neighbors (including the self-interactions). The full rank matrix blocks can 






































THE INVERSE FAST MULTIPOLE METHOD 


5 


be denoted as P2P-operators if the FMM terminology is employed. In ID, is a 
tri-diagonal matrix. However, for general point distributions, this matrix is a general 
sparse matrix, with no “particular” structure. The matrix represents 

the dense fill-ins arising from far field interactions. The decomposition presented in 
Eq. (1) is pictorially depicted in Figure 3 for a ID example and in Figure 4 for a 2D 
example. 



Fig. 3. Decomposition of the 'H^-matrix A at the leaf level L (with L = 3): A = + 





Fig. 4. Quadtree decomposition of the matrix. This figure is showing the decomposition at level 
3 for a 2D test case. This shows how the sketch in Figure 3 is generalized in dimensions higher 
than 1. 


The matrix in Eq. (1) is representative for all far field interactions. It is still 
rather dense and is therefore further decomposed using a similar decomposition as for 

A; 

(2) A = -P -P U^2) j^(2)Y(2)'r^ .y(3)'r 

This decomposition at level Z = 2 is illustrated in Figure 5 and 6 for a ID and 
2D example, respectively. The sparse matrix in Eq. (2) represents near field 
interactions at level I = 2 (that have not been accounted for through while 

represents the remaining far field interactions. As no well-separated clusters exist at 
level Z = 1, the matrix can not be further decomposed. 

Eq. (2) is still a dense system, but can now be converted into an extended sparse 
system by introducing auxiliary variables at each level that contains well-separated 
clusters (i.e., Z > 2). More specifically, multipole coefficients are defined as: 

y(3) ^ v(3)"x 

y(2) ^ v(2)^y(3) 
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Fig. 6. Quadtree decomposition at level 2 for a 2D test case. Compare with Figure 5 for a ID 
distribution of points. 


This allows rewriting the equation Ax = b as: 

(5) S(3)x + U(3) (^S(2)y(3) + U(2)K(2)y(2)^ = b 

Furthermore, local coefficients are identified at each level: 

(6) z(") = K(2)y(2) 

(7) z(3) =S(2)y(3) +U(2)z(2) 

The original equation Ax = b thus becomes: 

(8) S(3)x + U(3)z(3) =b 

Eq. (3)-(8) provide an extended sparse system of equations: 
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The extended sparse matrix in Eq. (9) (indicated as E in the following) contains 
matrices from the decompositions introduced in Eq. (1) and (2) and exhibits a sym¬ 
metric pattern in terms of fill-in, as is shown in Figure 7 and 8. In fact, for a symmetric 
matrix A, the matrix E is symmetric as well. This is due to the specific ordering of 
the equations. The elimination and substitution processes are thus performed on this 
sparse system rather than on the original dense system. 

This sparsification procedure is a very general and powerful technique that is not 
restricted to ID or 2D problems. In general, any dense "H^-matrix can be recursively 
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Fig. 7. A pictorial representation of the extended sparse system of Eg, (9) for a ID distribution 
of points. Red blocks are P2P-operators, cyan blocks are L2P-, P2M-, L2L-, and M2M-operators, and 
orange blocks are M2L-operators. The black blocks represent negative identity matrices. The original 
unknowns x are shown in blue, the locals in light green, and the multipoles in magenta. 
The right hand side vector b is depicted in gray. 



Fig. 8. Extended sparse matrix for a 2D distribution of points. Compare with the sketch shown 
in Figure 7 for a ID distribution of points. 


decomposed into sparse matrices and low-rank approximations to obtain an extended 
sparsified matrix. 

In order to further illustrate the sparsification procedure, it is instructive to con¬ 
sider graph representations of the linear systems under concern. This will be partic¬ 
ularly helpful for the interpretation of the second key step of the algorithm, as will 
be elucidated in Subsection 2.2. Before discussing the graph of Eq. (9), the idea of 
representing a linear system as a graph is briefly recalled here by means of a small 
example. Consider the following generic 3x3 block system (this is not related to any 
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of the systems discussed earlier): 



A graph representation of Eq. (10), consisting of a set of vertices and edges, is depicted 
in Figure 9(a). Each vertex in the graph is associated with a variable x^, while the 
set of incoming edges into that vertex defines an equation that corresponds to row i 
of Eq. (10). The edges correspond to the matrix subblocks and represent the 
contribution of variable Xj to the equation of row i. The arrows indicate the direction 
of each edge. The effect of eliminating a variable from Eq. (10) on the corresponding 
graph is illustrated in Figure 9(b), highlighting how the edges are updated accordingly. 

It is worth noting that the columns of the matrix have a clear meaning (i.e., they 
correspond to the unknowns), while the meaning of the rows is less clear as they 
correspond to the equations (and these can easily be permuted). Correspondingly in 
the graph, the tail of an arrow has a clear interpretation as it is associated with an 
unknown, while the head is somewhat arbitrary as it corresponds to an equation (and 
thus a specific ordering of the equations). 


All 



Fig. 9. The graph of Eq. (10) (a) before and (b) after the elimination of the variable xi. 

Coming back to the problem under concern. Figure 10(a) and 10(b) show the 
graphs of the original dense system and the extended sparse system corresponding to 
the ID problem introduced in Figure 1, respectively. Note that in- and outgoing edges 
have been combined into a single edge (with two arrows) in order to limit the number 
of edges in these figures. The layout of the graph is dictated by the specific ordering 
of equations that we have chosen in Eq. (9), leading to a symmetric extended sparse 
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matrix. The same color convention as in Figure 7 is employed for the vertices and 
edges. The transformation from Figure 10(a) to 10(b) clearly demonstrates how one 
can take advantage of the 'H^-structure of the matrix to obtain a sparsified system. 
Figure 10(b) also illustrates that the variables at level I can be considered as 
unknowns at the parent level I — 1, hence revealing the hierarchical structure 

of the 'H^-matrix (with corresponding to x). 




Fig. 10. The graph of (a) the original dense system and (b) the extended sparse system, 
corresponding to the ID problem of Figure 1. The same color convention as in Figure 7 is adopted: 
red edges in between x correspond to cyan edges from to x correspond to cyan edges 

from X to correspond to , black edges from to z^^^ correspond to —I, black edges from 

z^^^ to correspond to —I, orange edges in between y^^^ correspond to cyan edges from 

z^^^ to y^^^ correspond to cyan edges from y^^^ to z^^^ correspond to , black edges from 

y^^^ to z^^^ correspond to —I, black edges from z^^^ to y^^^ correspond to —I, and orange edges in 
between y^^^ correspond to 


2.2. Compression and redirection of fill-ins. Transforming a dense l-L^- 
matrix into an extended sparse matrix is a crucial but insufficient step to obtain a fast 
solver. Indeed, feeding a matrix such as the ones depicted in Figure 7 and Figure 8 to 
a conventional sparse solver leads to the creation of dense fill-ins throughout the elim¬ 
ination phase (or, equivalently, additional edges in the graph of Figure 10(b)), which 
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is harmful for the scaling of the algorithm (see for example the thesis of Pals [43]). In 
order to preserve the initial sparsity pattern, a second key idea is therefore introduced: 
fill-ins corresponding to interactions between well-separated clusters are compressed 
into a low-rank representation and are subsequently redirected through other opera¬ 
tors/edges. This procedure is discussed step by step in the next paragraphs, focusing 
on a graph representation of the methodology. To this end, the one-dimensional ex¬ 
ample introduced in Subsection 2.1 is abandoned and a more general case is now 
considered. References to the original FMM-terminology (e.g., particles, multipoles, 
and locals) are omitted to stress the generality of the method. 

The following simplifying notations are introduced (see Figure 11): nodes in a 
graph corresponding to the variables xf^, z®, and yf^ of a cluster are referred 
to as nodes and respectively (the position of the bar is a mnemonic 

corresponding to the location of the node in the tree, either below or above). The 
symbol ' represents the variable Zp of its parent cluster flp . The latter can 
also be denoted as If no confusion is possible, the superscripts referring to the 

hierarchical level of a cluster will be omitted in the following for sake of readability. 
With this notation in place, operators such as the inter- and anterpolation^ matrices 
Ui and YJ of a cluster Qi are written as U(i, i) and V'^(i, i), respectively. 

it® = p(/-i) 

i(0 
i(0 



Fig. 11. Alternative notation for the nodes related to cluster 

Figure 12(a) shows part of a graph of an extended sparse matrix at a certain stage 
of the elimination phase of level I, i.e., after elimination of levels L, L — 1, ..., Z-l-1. 
This graph shows a “generic” description of the process of removing hll-in. The nodes 
shown are generic nodes in the graph of the matrix. 

The variables and of some clusters at level I have already been eliminated 
at this point, e.g., x® and of cluster (or, equivalently, the nodes h and h). 
This has created edges in the graph between neighboring clusters, e.g., connections 
(h,i) and (i,h) between and xf^ (and vice versa). Additional edges such as 
(h,h) (i.e., between and itself) have been established as well. The variables x® 
and zf^ of cluster that are to be eliminated next are highlighted in Figure 12(a). 
Performing this elimination results in new interactions between all neighbors of 


“^Interpolation = from coarse to fine. Anterpolation = from fine to coarse. 
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(including itself), as is illustrated in Figure 12(b)®. Some of these fill-ins establish 
direct connections between clusters that did not exist prior to this elimination step, 
however. For example, fill-ins E'(k, j) and E'(j,k) between fl® and arise (where 
E' denotes a fill-in in the extended sparse matrix). As the aforementioned clusters 
are well-separated (i.e., their existing interactions are accounted for through the edges 
between k and j rather than between k and j), these fill-ins are expected to be low- 
rank and can be approximated as follows: 

(11) E' (k, j) =. U' (k, k) K' (k, j) V'" (j, j) 

(12) E'(j,k) U'(j, j)K'(j,k) V'"(k,k) 

It is assumed in the following that Eq. (11) and (12) represent partial singular value 
decompositions (SVD), implying that the matrices U' and V' are orthonormal and 
that K' is a diagonal matrix containing the singular values. The matrices K'(k, j) 
and K'(j,k) are therefore also denoted as S'(k, j) and S'(j,k), respectively. The 
implications for the algorithm if this is not the case will be discussed in Subsection 2.4. 
It is also assumed that a criterion is available for determining an appropriate rank for 
approximations (11) and (12) (see Subsection 2.4 for more details). 

These low-rank approximations for the fill-in enable the calculation of new inter- 
and anterpolation operators for and respectively. For example, a new in¬ 
terpolation operator U(j, j) for is obtained by concatenating the existing scaled 
interpolation operator U(j,j)Su (where Su contains the weights associated with 
each column of U(j, j))® and the scaled fill-in operator U'(j, j) S'(j, k) and subse¬ 
quently applying a low-rank recompression (thin SVD): 


(13) 


U(j,j)Su|U'(j,j)S'(j,k) 


U(j,j)Su$, 


= [(/)j I represents a fat, row orthogonal, matrix. It is important to note that 
the recompression in Eq. (13) is applied to the scaled interpolation operators rather 
than to U(j, j) I U'(j, j) , as this allows to account for the correct “weight” of each 
vector when obtaining a new basis.^ Eq. (13) allows expressing the existing and fill-in 
operators U(j, j) and U'(j, j) in terms of the new basis U(j, j): 


(14) U( j, j) :^ U(j, j) Su Su' = U(j, j) r, 

(15) U'(j, j) U(j, j) Su S'(j,k)-i = U(j, j) r' 

The weights Su of the new interpolation basis U(j, j) are stored, as they are needed 

in potential future recompressions. A new anterpolation operator V(j, j) for is 
obtained in a similar way: 


(16) 


V(j,j)Sv|V'(j,j)S'(k,j) 


V(j,j)Sv’3>, 


^Eliminating a node also implies that the right hand side of the system needs to be updated. 
^Initial weights Su U(j, j) (i.e., upon initialization of the IFMM operators) can be obtained 

rigorously by applying an SVD to ■ |U(j, j) K(j, q) V^(q, q) | ■ ■ ■ j G with the 

interaction list of or can be roughly estimated by sampling some matrix entries. The latter 

approach is less expensive in terms of computational cost. 

^The proof of this result, which is elementary, is omitted. 









12 


PIETER COULIER, HADI POURANSARI, AND ERIC DARVE 


with I i/j'], such that V(j, j) and V'(j, j) can be written as: 

(17) V(j, j) =. V(j, j) Sv Sv' = V(j, j) t, 

(18) V'(j, j) V(j, j) Sv S'(k, j)-i = V(j, j) t' 

The same strategy as outlined in Eq. (13) to (18) is applicable for obtaining new 
operators U(k, k) and V(k, k) for . 

The crucial benefit of compressing the dense fill-ins and updating the inter- and 
anterpolation operators of the corresponding clusters is that the former can now be 
accounted for through the operators K(k, j) and K(j,k) between and For 
example, the fill-in E'(k, j) is redirected through the K(k, j)-operator, which becomes: 

(19) K(k,j) ^rfcK(k,j)tJ -f rfeK'(k, j)tf 

Updating the inter- and anterpolation operators also implies that all the other edges 
in the graph associated with need to be adjusted. More specifically, the remaining 
operators K(q, j) and K(j,q) (with G are replaced by: 

(20) K(q,j)^K(q,j)tJ 

(21) K(j,q)^r,K(j,q) 

where and represent the neighbor list (including itself) and interaction list 

(i.e., well-separated clusters that are children of its parent’s neighbors) of re¬ 
spectively. Furthermore, the inter- and anterpolation operators of the parent cluster 
of are updated as follows: 

(22) U(j,jt)^r,U(j,j^) 

(23) V(jt,j)^t,V(jt,j) 

Similar updates as in Eq. (19)-(23) are required for all edges associated with the 
cluster Edges that need to be updated are highlighted in yellow in Figure 12(c). 

The technique outlined above leads to a significant reduction of the amount of 
fill-in and ensures that the sparsity pattern of the extended sparse matrix is preserved 
throughout the elimination phase. This leads to an algorithm with linear complexity 
0{N). This assumes that the rank remains bounded independent of the matrix size. 
This is expected to be the case, as only interactions between well-separated clusters 
are replaced by low-rank approximations. 

This strategy of compressing and redirecting dense fill-ins is very similar to the 
initial sparsification approach introduced in Subsection 2.1, as low-rank interactions 
are represented at the parent level. Apart from fill-ins between the x-variables (e.g., 
between x® and x^*^), fill-ins between the variables x*^^) and are created as well 
throughout the elimination phase. If these occur between well-separated clusters 
(for example E'(h, j), E'(j,h) and E'(h,k), E'(k,h) in Figure 12(b)), the fill-ins are 
compressed and redirected through other operators at the parent level. 

Once all unknowns x^*) and variables at level I have been eliminated, the next 
level is considered. From a conceptual point of view, variables can be interpreted 
as unknowns of the parent level I — 1. More specifically, a set of variables yf^ 
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X(/-1) = y(0 

z(0 


x(0 


(a) 



z^^ 1) 


x(i~l) = y(/) 
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(b) 



z('-i) 


X(/-1) = y(0 
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x(0 


(c) 


T T T 



Fig. 12. Subgraph (a) before and (b) after the elimination of the variables and New 
and modified edges are highlighted in yellow in (b). Operators associated with edges between well- 
separated clusters (indicated by dashed yellow lines in (b)) are compressed and redirected, resulting in 
the subgraph shown in (c). The edges that are updated after compression are highlighted in yellow in 
(c). This strategy of compressing and redirecting dense fill-ins is similar to the initial sparsification 
approach introduced in Subsection 2.1, as low-rank interactions are represented at the parent level. 
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is combined to form Xp , where the clusters are children of the parent cluster 


n 


d-i). 


(24) xd-i) = {...yf) ..y 

or, equivalently, the nodes are joined to form 


(25) = y 

Cp in Eq. (24) and (25) denotes the children list of flp This merging proce¬ 
dure is illustrated in Figure 13. Furthermore, the corresponding operators are com¬ 
bined accordingly to form the inter- and anterpolation operators U(p^*“^\ p^*“^^) and 
v(p('-i),p('-i)) of 


(26) U(p('-i),p('-i)) = 


(27) V(p('-i),p('-i)) = 


U(iW^it(0) 


V(itW),i(0) 






A similar operation is performed to merge the interactions between variables into 
interactions between unknowns For example, the interaction between Xp 

and xy of clusters and flq is assembled as follows: 


(28) 


K(p('-l),q('-l)) = 


K(i(0j(0) 




The resulting graph has exactly the same structure as the initial one, except for 
the fact that it consists of one level less. The elimination and merging procedure 
are repeated consecutively for each level I > 2 until only a small dense system for 
(or, equivalently, x*^^)) remains to be solved. Once this is done, the elimination 
phase is terminated and all the other variables can be retrieved through substitution, 
marching down the graph in the reverse order. In the end, the solution x is obtained. 

2.3. The algorithm. Based on the key ideas introduced in Subsection 2.1 and 
2.2, the general IFMM algorithm can now be presented in a formal way. In the 
following, it is assumed that the "H^-structure of the matrix A is based on a three- 
dimensional octree decomposition, implying that every cluster can have up to 
3^ = 27 clusters in its neighbor list (including itself) and up to 6^ — 3^ = 189 
clusters in its interaction list . 
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Fig. 13. Subgraph (a) before and (b) after merging the variables of level I into unknowns 
x{*—1) of the parent level I — 1. The associated operators are combined accordingly. 


Algorithm 1 presents the general outline of the IFMM. It consists of initializ¬ 
ing the IFMM-operators, performing an upward pass to eliminate each level I (and 
forming the appropriate operators of the parent level I — 1), solving for 
at the top level, and substituting through a downward pass to finally retrieve the 
solution X. The elimination procedure at each level I is detailed in Algorithm 2. 
Eliminating a cluster leads to new interactions between its neighbors and 
n® (with n® S j) and E'(j,k) if both and have already 

been eliminated, E'(k, j) and E'(j,k) if has been eliminated but has not 
(and vice versa), and E'(k, j) and E'(j,k) if neither nor 11 has been eliminated 

yet. If and are well-separated, these fill-ins are compressed into a low-rank 
approximation and subsequently redirected through other operators, as discussed in 
Subsection 2.2. This redirection procedure is described in Algorithm 3. If this is not 
the case, the additional interactions are added to the existing ones. The substitution 
phase proceeds level per level in the reverse elimination order and allows determining 

and x^^) based on y^*). If I is not the leaf level, the variables x^*) are nothing 
else than the variables y(*+i) of the child level, cf., Eq. (24). At the leaf level, x^^^ 
corresponds to the desired solution x. As the elimination phase is decoupled from 
the substitution phase (as in a standard direct solve), the algorithm is well suited to 
handle multiple right hand sides b. 

The algorithm, as described in this Subsection, is for a general matrix A; it is 
not specialized for the symmetric and positive definite case. It is possible to take 
advantage of the symmetry and positive-definiteness of a matrix to reduce the com¬ 
putational cost with almost a factor of two (similar to the Cholesky factorization 
that is almost twice as efficient as the LU factorization). This does not alter the 




















16 


PIETER COULIER, HADI POURANSARI, AND ERIC DARVE 


fundamental ideas of the proposed algorithm. 


Algorithm 1 The inverse fast multipole method: overall algorithm 
1: Construct an octree decomposition of the domain 
2: for ^ = L, L — 1,..., 2 do 
3: Initialize the IFMM operators at level I 

4; end for 

5: for I = L, L — 1,... ,2 do 

6: Eliminate level I (Algorithm 2) and update the right hand side 

7: if ^ > 2 then 

8: Form the appropriate operators of the parent level I — 1 

9: end if 

10: end for 
11: Solve for (= 

12: for ^ = 2,..., L — 1, L do 

13: Substitute at level I for and 

14: ii I < L then 

15: Form based onx« (Eq. (24)) 

16: else 

17: Set X = X^-^) 

18: end if 

19: end for 


2.4. Some important remarks. Several techniques can be employed for ob¬ 
taining a suitable low-rank approximation for the dense fill-ins E' arising throughout 
the elimination phase (cf., Eq. (11) and (12)). The best possible approximation of 
rank A: of a matrix (minimizing the error norm) is given by its partial singular 
value decomposition (SVD) [21]. Unfortunately, the calculation of a complete SVD 
requires O (min m^n)) arithmetical operations for a (m x n) matrix, making 

it rather unattractive from a computational point of view. The randomized SVD 
(rSVD) algorithm [31] provides a very efficient alternative for computing (approxima¬ 
tions of) the first few singular values and vectors. This method exploits the power of 
randomization to identify a subspace that captures most of the action of a matrix [31]. 
It involves generating random matrices, performing matrix-matrix products, applying 
orthonormalizations, and computing the SVD of a much smaller matrix. Although 
it uses random matrices, the rSVD algorithm gives very accurate results with a high 
probability, provided that the singular values of the matrix decay sufficiently fast. 
Rather than specifying a fixed rank for the low-rank approximation of a dense fill-in, 
the rank can also be determined by prescribing a relative accuracy e. The relative 
importance of a fill-in is hence evaluated by weighting its singular values (obtained, 
for example, with the rSVD algorithm) with respect to some reference value. In this 
paper, it is proposed to employ the largest singular value cro(E) of the extended sparse 
matrix E for this purpose. The retained rank of a fill-in E' is thus given by®: 

(29) k{e) = min{/c G N : crfc+i(E') < ecro(E)} 

CTo (E) can easily be estimated by application of the rSVD algorithm to the extended 
sparse matrix E. 


is also bounded by the size of E'. 
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Algorithm 2 The inverse fast multipole method: elimination of level 1 . In order to 
simplify the notations as much as possible, we made the following choices. The letter 
E denotes the sparse matrix on which we are doing the Gaussian elimination. The 
fill-in introduced is denoted by E'. This fill-in is progressively compressed (if well- 
separated) and folded into existing edges (non-zero entries) of E. This compression 
process is done using matrices denoted by U', V', and K'. U' is an interpolation 
operator (i to ±), while V' is an anterpolation operator (i to i). The entries denoted 
by K' correspond to edges between nodes at the same level. 

1: for all clusters at level I do 
2 : Eliminate i and i of flf ^ 

3; for all unique pairs with do 

4; if j and j of have been eliminated then 

5: if k and k of have been eliminated then 

6: Compute the hll-ins E'(k, j) and E'(j,k) 

7: Add the fill-ins: E(k, j) ^ E(k, j) -f E'(k, j) and E( j, k) ^ E(j, k) -f E'(j, k) 

8: else 

9: Compute the fill-ins E'(k, j) and E'(j,k) 

10: if n® ^ then 

11: Compress E'(k, j) ~ U'(k, k) K'(k, j) and E'(j,k) ~ K'(j,k) V'^(k,k) 

12: Update the edges associated with (Algorithm 3) 

13: else 

14: Add the fill-ins: E(k, j) •(— E(k, j)-l-E'(k, j) and E(j,k) •(— E(j,k)-|-E'(j,k) 

15: end if 

16: end if 

17: else 

18: if k and k of have been eliminated then 

19: Compute the fill-ins: E'(k, j) and E'(j,k) 

20: if ^ then 

21 : Compress E'(k, j) ~ K'(k, j) V'"(j, j) and E'(j,k) ~ U'(j, j) K'(j,k) 

22: Update the edges associated with (Algorithm 3) 

23: else 

24: Add the fill-ins: E(k, j) ^ E(k, j)-|-E'(k, j) andE(j,k) ^ E(j,k)-|-E'(j,k) 

25: end if 

26: else 

27: Compute the fill-ins E'(k, j) and E'(j,k) 

28: if ^ then 

29: Compress E'(k,j) ~ U'(k,k) K'(k, j) V'"(j, j) and E'(j,k) ~ 

U'(j,j)K'(j,k)V'"(k,k) 

30: Update the edges associated with (Algorithm 3) 

31: Update the edges associated with fl® (Algorithm 3) 

32: else 

33: Add the fill-ins: E(k, j) v- E(k, j)-|-E'(k, j) andE(j,k) •«— E(j,k)-|-E'(j,k) 

34: end if 

35: end if 

36: end if 

37: end for 
38: end for 
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Algorithm 3 The inverse fast multipole method: updating of the edges associated 
with after compressing the interactions between and 

1 

U(j,j)^ 

U(j_, j), Su f- Su 

2 

V(j,j)^ 

j)i f- Sv 

3 

U(j,jt)^r,U(j,jt) 

4 

V(jnj)^t,V(jtj) 

5 

for all G do 

6 

K(q,j) 

^K(q,j)tJ 

7 

K(j,q) 

^ OK(j,q) 

8 

end for 

of have been eliminated then 

9 

if k and k 

10 

K(k,j) 

^K(k,j)tJ + K'(k,j)tf 

11 

K(j,k) 

^ OK(j,k) -hr'K'(j,k) 

12 

else 


13 

K(k,j) 

^rfeK(k,j)tJ + r',K'(k,j)tf 

14 

end if 



A low-rank recompression is also required for calculating new inter- and anterpo- 
lation operators of a cluster after compression of the fill-ins (cf., Eq. (13) and (16)). 
Once again, computing the SVD will provide the optimal recompression, but at a rela¬ 
tively high computational cost. Applying the rSVD algorithm will not be very efficient 
either in this case, as the rank of the new operators U(j, j) and V(j, j) is at least as 
large as the rank of the original operators U( j, j ) and V( j, j ) (as the latter consist of 
orthogonal columns). This recompression hence differs from the compression of the 
fill-ins, where only the first few singular values are (expected to be) of importance. 
The fully pivoted adaptive cross approximation (ACA) algorithm [5, 46]®, followed by 
an SVD recompression (i.e., two skinny QR decompositions and an SVD performed 
on the product of the resulting triangular matrices), provides a more efficient way for 
performing the recompression and is therefore favored. 

A final remark is related to the weighting of the basis vectors when computing new 
inter- and anterpolation operators, as presented in Eq. (13) and (16). The advantage 
of using rSVD and ACA-I-SVD is that the obtained singular values straightforwardly 
provide the weights for each vector in the basis. If the low-rank approximations in 
Eq. (11), (12), (13) and (16) do not correspond to singular value decompositions (e.g., 
when using a rank-revealing LU decomposition), the weighting needs to be chosen in a 
different manner. The pivots of the LU factorization could, for example, be employed 
for this purpose. Although this approach is less accurate than using the singular 
values, it can still provide a reasonable measure of the scale of each basis vector when 
performing the recompression. 

2.5. Computational cost. We will derive the computational cost in the case 
of a uniform tree decomposition of the matrix (as opposed to adaptive), although the 
proof extends to general adaptive trees. 

Let’s assume that we have a matrix A in H®-format. Given a tolerance e, we 
assume that there is an upper bound r on the rank of all compressed blocks such that 
r G ©(log 1/e). One can prove that for the original matrix A, in the case of smooth 


®This approach is the same as an LU factorization with full pivoting. 
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kernels (e.g., Green’s functions), the rank r of the compressed blocks does satisfy this 
property [19]. However, in our case, as we proceed through the LU factorization, 
recompression is required. We have not proved that the rank does not increase in an 
unbounded fashion (for a fixed e, as N varies) as a result of this process. However, 
we did observe experimentally that the rank is in fact bounded. See in particular 
Figure 17. 

Under these assumptions, the cost of operating on a node at a given level is 0{r^). 
The cost of a leaf node elimination is 0{Nq), where Nq is an upper bound on the size 
of all leaf clusters. Choose Nq G H(r). This requirement sets the number of levels in 
the tree, and the number of leaf nodes. In that case, the cost of the elimination of 
all leaf nodes is 0{Nr‘^). The cost of compression of the far-away fill-in blocks, and 
computation of the updated parent-level matrix blocks is also 0{Nr^). Since at the 
next level, the number of clusters is divided at least by a fixed factor (e.g., 2 for a 
binary tree or 8 for an octree), the total computational cost of the method is 0{Nr^) 
or C>(A^log(l/e)^). 

3. Numerical examples. The IFMM algorithm has been implemented in C++, 
using the Eigen-library [23] for performing dense linear algebra operations. In the fol¬ 
lowing subsections, numerical examples are considered to validate the implementation 
and to demonstrate the efficiency of the methodology. All computations have been 
performed on Intel® Xeon® E5-2650 v2 (2.60 GHz) CPUs. In each of these examples, 
the initial low-rank approximations in the H^-matrix are created using interpolation 
based on Chebyshev polynomials [19], followed by an additional SVD to reduce the 
rank. A uniform octree decomposition of the domain is employed; the number of lev¬ 
els is adjusted to ensure that the leaf clusters contain approximately 100 points. This 
value was found to provide a reasonable trade-off between the time spent at the leaf 
level and the time spent at higher levels in the tree.^° A relative accuracy £ = 10“^ 
is used, both for compressing the fill-ins (with rSVD) and for updating the inter- and 
anterpolation operators (with ACA). 

3.1. The IFMM as a direct solver. Consider the linear system of equations 
Ax = b with matrix entries = Ar(jjri — r^]]) depending on the kernel K{r): 

{ 1 if r = 0, 

^ if 0 < r < d, 

7 if r>d. 

with Vi and Vj indicating 3D position vectors and r = [Jr^ — r^-jj. Figure 14 shows the 
kernel function K{r) for three values of the parameter d in Eq. (30). This parameter 
strongly affects the condition number k(A) of the matrix A: the larger d, the larger 
the condition number (for a fixed matrix size N). The condition number also increases 
with N for a hxed value of d, as illustrated in Figure 15. Note that the matrix entries 
are directly defined by the kernel, rather than by a boundary integral formulation 
that involves integration of the kernel. Instead of using a kernel K{r) = 1/r, the 
kernel is modified near r = 0 according to Eq. (30) in order to ensure that bounded 
matrix entries are obtained on/near the diagonal of A. 

In the following, a known vector x is chosen and the matrix-vector product Ax 
is evaluated for obtaining b. The IFMM is subsequently used as a direct solver, with 

^^Less points per leaf cluster makes the elimination of the leaves faster, but increases the number 
of levels and thus the time spent in the tree, and vice versa. 
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Fig. 14. Kernel function K(r) for three values of the parameter d. 




Fig. 15. Condition number k(A) as a function of the matrix size N, for points randomly 
distributed (a) on the unit sphere and (b) in the cube [—1,1]^. 


the aforementioned vector b as right hand side. The retrieved solution is denoted 
as X. 

In a first example, N points r are randomly distributed on the unit sphere. The 
parameter d is scaled according to d = 10“^ (-/V/1000)“^/^ in order to ensure that 
the condition number of A remains nearly constant for all N (see Figure 15(a)). 
Figure 16(a) shows the total time taken by the IFMM solver as a function of the 
number of points N, with the number of Chebyshev nodes n (in each dimension) 
varying from 1 to 4 (the total number of Chebyshev interpolation nodes employed 
for constructing the initial low-rank approximations hence equals n^). These curves 
demonstrate that, as expected, a linear complexity 0{N) is achieved. The computa¬ 
tion time increases for larger values of n. The relative error ||x — x|| 2 /||x ||2 for each 
of these cases is depicted in Figure 16(b). The figure clearly shows a decrease of the 
error as n is increased. The errors are reasonably small for all problem sizes N. 

A crucial feature for achieving the linear complexity demonstrated in Figure 16(a) 
is the fact that the rank of the matrix blocks corresponding to far held interactions 
remains (almost) constant if the matrix size N is increased (due to the strong admissi¬ 
bility criterion employed for 'H^-matrices). This is in contrast with HODLR and HSS 
matrices, where the rank grows like 0{'/N) for 2D problems [1]. Figure 17 shows the 
maximum and average rank of the aforementioned matrix blocks as a function of N 
for the case of n = 4, conhrming the previous statement (especially for N > 10^). 
The kinks in the curve of the average rank are caused by varying the number of levels 
in the octree. 

Next, a more general example of N points r randomly distributed in the cube 
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Fig. 16. (a) Total CPU time and (b) the relative error ||x — x|| 2 /||x ||2 o,s a function of the 
matrix size N, for points randomly distributed on the unit sphere. 



Fig. 17. Maximum (black line) and average (grey line) rank of the matrix blocks corresponding 
to far field interactions as a function of the matrix size N, for points randomly distributed on the 
unit sphere (with n = A). 


[—1,1]^ is investigated. The parameter d varies in this case as d = 10“^ (A^/1000)“^/^ 
to keep the condition number of A nearly constant for all N (see Figure 15(b)). 
Figure 18(a) confirms that a linear scaling of the computation time is also obtained 
in this case (although only from N ^ 10^ upwards), while Figure 18(b) demonstrates 
the convergence of the relative error as the number of Chebyshev nodes is increased. 
The computation times are slightly higher than in the case of points distributed on 
a sphere (Figure 16(a)). This is due to the fact that a uniform octree decomposition 
of a sphere leads to many empty clusters, implying that most of the clusters will 
have less than 27 neighbors, while it is likely that almost all clusters will possess the 
maximum number of neighbors in case the points are randomly distributed in a cube. 
The reader is referred to Appendix A for a detailed breakdown of the computational 
cost for this example. 

Depending on the application of interest, the errors shown in Figure 16(b) and 18(b) 
might already be sufficiently small. The convergence of the IFMM is investigated in 
more detail in Figure 19, which shows the relative error ||x — x|| 2 /||x ||2 as a function 
of the number of Chebyshev nodes for a fixed matrix size N = 8 x 10^. An increase 
of n leads to more accurate low-rank approximations of the far field interactions and 
consequently a smaller overall error in x. This figure demonstrates that the method 
is capable of serving as a highly accurate direct solver. The number of Chebyshev 
nodes needs to be increased significantly if 10 or more digits of accuracy are desired, 
however, which leads to a strong increase of the computational cost (even if the lin- 
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Fig. 18. (a) Total CPU time and (b) the relative error ||x — 
matrix size N, for points randomly distributed in the cube [—1,1]®. 


as a function of the 


ear complexity is preserved). In most cases (depending on the number of right-hand 
sides), it is more efficient to apply a low accuracy IFMM solver as a preconditioner in 
an iterative method rather than using a highly accurate direct IFMM solver, as will 
be illustrated in the next subsection. 



Fig. 19. The relative error ||x — x|| 2 /||x ||2 as a function of the number of Chebyshev nodes n, 
for N = 8000 points randomly distributed in the cube [—1,1]®. 

3.2. The IFMM as a preconditioner in an iterative scheme. In this sub¬ 
section, the linear system Ax = b is solved iteratively using a non-restarted GMRES- 
algorithm [50, 51] with a tolerance e = 10“^° for the relative residual jjb —Ax|| 2 /||b ||2 
and a maximum of 500 iterations. The same kernel K(r) as in Eq. (30) is used. The 
black-box FMM [19] is employed for evaluating the matrix-vector products within 
GMRES, while a low accuracy IFMM solver is applied as right preconditioner in 
each iterative step. A comparison of right and left preconditioning is provided in 
Appendix B. 

The first example involves N = 10^ points randomly distributed in the cube 
[—1,1]^, with d = 10“^ (in Eq. (30)). Figure 20(a) shows the relative residual as a 
function of the number of GMRES-iterations for several preconditioning approaches. 
Without preconditioning, convergence is rather slow and a total of 78 iterations are 
required to achieve convergence. The application of the IFMM as preconditioner re¬ 
sults in a much faster convergence. Even with a rank-1 approximation of the far 
field interactions (i.e., n = 1), a significant reduction of the number of iterations is 
obtained. An increase of the number of Chebyshev nodes leads to a more accurate 
preconditioner and hence a further reduction of the number of iterations; this comes 
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at the price of a higher cost to construct the preconditioner, however. The computa¬ 
tion times for the considered preconditioning strategies are presented in Figure 20(b), 
where a decomposition is made into the time required to construct the preconditioner 
(i.e., the elimination phase in the IFMM) and the actual iteration time (i.e., the sub¬ 
stitution phase in the IFMM, together with all GMRES-related computations). For 
the example under concern, the use of three Chebyshev nodes per dimension leads to 
the smallest overall computation time. The results are summarized in Table 1. 



Fig. 20 . (a) Relative residual ||b — Ax||2/||b||2 as a function of the iteration step and (b) total 
CPU time for d = 10 “®. 


Table 1 

Overview of the results for various preconditioning strategies for d = 10“® 


# iterations Total CPU time [s] ||b — Ax||2/||b||2 ||x — x||2/||x||2 


No precon 

78 

509 

9.1 

X 

10-11 

4.6 

X 

10-10 

IFMM {n = 

1) 

19 

213 

6.1 

X 

10-11 

4.0 

X 

10-11 

IFMM (n = 

2) 

6 

119 

1.5 

X 

10-11 

9.6 

X 

10-12 

IFMM {n = 

3) 

3 

118 

5.2 

X 

10-11 

2.9 

X 

10-11 

IFMM (n = 

4) 

2 

265 

2.7 

X 

10-11 

2.1 

X 

10-11 


In the second example, the parameter d in Eq. (30) is increased from 10“^ to 
10“^, resulting in a more ill-conditioned matrix A. Figure 21(a) demonstrates that 
the unpreconditioned GMRES-algorithm is now unable to converge to the specified 
residual within 500 iterations. Application of the IFMM as preconditioner ensures 
convergence, although a rank-1 approximation of the far field is not very efficient in 
this case, as the number of iterations remains rather large. As soon as two or more 
Ghebyshev nodes are employed, a remarkable reduction of the number of iterations 
is obtained and convergence is much faster. Figure 21(b) and Table 2 illustrate that 
the use of three Chebyshev nodes per dimension results once more in the smallest 
overall computation time. It is also important to note that although the relative 
residual ||b — Ax|| 2 /||b ||2 reaches the specified tolerance e, the actual relative error 
||x — x|| 2 /||x ||2 is a few order of magnitudes larger. This is due to the large condition 
number of A. 

In order to provide more insight in the effect of the IFMM as preconditioner, the 
eigenvalue distributions of the original and preconditioned matrices A and P~^A are 
investigated, where represents the IFMM preconditioner. A smaller problem of 
size N = 10^ is considered here. It is furthermore emphasized that the ‘true’ matrix A 
is used for this purpose, and not an FMM approximation of A. Figure 22(a) and 22(b) 
show the eigenvalues of A and P“^A for d = 10“^ and d = 10“^, respectively. As A is 
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Fig. 21. (a) Relative residual ||b — Ax|| 2 /||b 
CPU time for d = 10“^. 



No precon n=1 n=2 n=3 n=4 


(b) 

1 2 CLS a function of the iteration step and (b) total 


Table 2 

Overview of the results for various preconditioning strategies for d = 10“^. The results are 
surrounded with parentheses if no convergence was achieved. 



^ iterations 

Total CPU time [s] 

||b - Ax|| 2 /||b ||2 

||x-x|| 2 /||x ||2 

No precon 

(500) 

(4290) 

(6.3 X 10“®) 

(4.2 X 10“^) 

IFMM (n = 1) 

305 

2522 

9.7 X 10“ii 

3.8 X 10“® 

IFMM (n = 2) 

61 

521 

8.3 X 10“ii 

2.2 X lO-l 

IFMM (n = 3) 

24 

301 

7.3 X 10-11 

4.6 X 10-1 

IFMM (n = 4) 

14 

414 

7.3 X 10“ii 

7.9 X 10-1 


symmetric, all eigenvalues are real. It is clear from these figures that the eigenvalues 
of A are not well clustered. This explains the slow convergence of GMRES if no 
preconditioning is employed. Application of the IFMM as preconditioner leads to a 
better clustering of the eigenvalues. An increase of the number of Chebyshev nodes 
results in a more compact clustering of the eigenvalues near 1, and consequently a 
faster convergence of the iterative solver. 
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Fig. 22. Eigenvalue distribution of the original (black crosses) and the preconditioned (colored 
crosses) matrices of size N = for (a) d = 10“^ and (b) d = 10“^. 


The results presented in this subsection clearly illustrate the effectiveness of the 
IFMM as preconditioner for an iterative solver. There is a trade-off between the 
preconditioner’s accuracy (and thus the number of iterations) and its construction 
cost, which has to be evaluated on a case-by-case basis. 

4. Application: rigid bodies immersed in a Stokes flow. In order to illus¬ 
trate the applicability of the IFMM to challenging science and engineering problems 
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(as opposed to the synthetic examples discussed in Section 3), an application involv¬ 
ing rigid bodies immersed in a Stokes flow is considered in this section. These rigid 
bodies represent, for example, colloidal particles suspended in a solvent [18]. Each 
rigid body is discretized with a number of points on the surface, and a force vector 
Fi G and velocity vector Ui G are associated with each point. The linear 

system of equations that needs to be solved reads as: 

(31) 7WF = U 

where F G R^^^^ and U G collect the forces and velocities of all N points, 

respectively. The mobility matrix A4 G r 3^^3^ consists ot N x N blocks of size 3x3. 
A subblock Mij G R^^^ corresponding to the interaction between points and rj 
can be expressed as: 

(32) = /(||r„ |l)I-f g(||ry||)f„- (g)fy 

with Tij = Ti — Tj, while a hat in Eq. (32) denotes a unit vector, /(r) and g{r) 
are scalar functions of distance. In this paper, the commonly used Rotne-Prager- 
Yamakawa (RPY) kernel [48] is employed for f{r) and g{r), leading to a symmetric 
positive-semidefinite mobility matrix M. The reader is referred to [36] for a detailed 
overview of the immersed boundary method that leads to Eq. (31). 

The dense system presented in Eq. (31) is solved with GMRES; the matrix-vector 
products are evaluated with an FMM-scheme [36]. An arbitrary velocity vector U is 
taken as right hand side. First, a regular cubic lattice consisting of 16 x 16 x 16 spher¬ 
ical rigid bodies is considered. Each sphere is discretized with 42 particles, resulting 
in a system of equations with 3 x 42 x 4096 = 516096 unknowns. Figure 23(a) shows 
the relative residual as a function of the iteration step in the GMRES-algorithm, and 
this for three different preconditioning approaches. If no preconditioning is applied, 
convergence to the specified relative residual e = 10“® is achieved after 231 iterations. 
The use of a simple block-diagonal preconditioner (where the diagonal subblocks of 
size 126 X 126 correspond to the mobility matrices of the individual spheres) already 
leads to a significant reduction of the number of iterations and hence the overall 
computation time (see Figure 23 and Table 3). This is a very cheap precondition¬ 
ing strategy (i.e., the time needed to construct the preconditioner is negligible) that 
performs very well in this particular case, as the subblocks in the block-diagonal ma¬ 
trix represent the mobilities of disjoint rigid bodies. Application of the IFMM as 
preconditioner (with n = 2 Chebyshev nodes) results in a further reduction of the 
number of iterations. The overall speed-up remains rather limited, however, as the 
preconditioner’s construction cost is considerably higher. It is nevertheless remark¬ 
able that the IFMM still results in a speed-up, given the fact that the block-diagonal 
preconditioner already performs very well. 

Table 3 

Overview of the results for various preconditioning strategies for a cubic lattice consisting of 
4096 spheres, each discretized with 42 particles. 



iterations 

Total CPU time [s] 

||U-Y(F||2/||U||2 

No precon 

231 

7900 

9.3 X lO"'^ 

Block diagonal 

78 

2662 

8.3 X lO-'^ 

IFMM (n = 2) 

29 

1928 

7.5 X lO-'^ 


The second example involves six concentric spherical shells with a discretization 
ranging from 42 particles for the smallest inner shell to 40962 particles for the largest 
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No precon Block diagonal iFMM 


(b) 


Fig. 23. (a) Relative residual ||U —AIF|| 2 /||U ||2 as a function of the iteration step and (b) total 
CPU time for a cubic lattice consisting o/4096 spheres, each discretized with 42 particles. 


outer shell; the total number of particles equals 54612. The mobility matrix A4 is 
more ill-conditioned compared to the previous example. Once again, GMRES is em¬ 
ployed for solving Eq. (31). Figure 24(a) and Table 4 indicate that 331 iterations are 
required if no preconditioner is incorporated. The use of a block-diagonal precondi¬ 
tioner (where diagonal subblocks of size 108 x 108 have been employed^^) does not 
result in a speed-up; it even deteriorates the convergence. While this was a successful 
preconditioning strategy in the previous example, such an approach is too simplistic to 
be effective in this more complicated test case, as the diagonal subblocks do not corre¬ 
spond any more to the mobility matrices of disjoint spheres. Application of the IFMM 
as preconditioner (with n = 3 Chebyshev nodes), on the other hand, leads once more 
to a very fast convergence of the relative residual and consequently a considerable 
reduction of the computation time (see Figure 24(b)). Note that this is a relatively 
small problem compared to the previous example, as only 3 x 54612 = 163836 un¬ 
knowns are considered. The relative speed-up of the IFMM is expected to be even 
more signihcant for larger problems. 




No precon Block diagonal iFMM 


(b) 


Fig. 24. (a) Relative residual ||U —A4F||2/||U||2 as a function of the iteration step and (b) total 
CPU time for six concentric spherical shells discretized with 54612 particles. 


Defining a block diagonal preconditioner in this case is rather arbitrary, as the diagonal subblocks 
do not correspond to the self-interactions of individual spheres anymore. The size of the subblocks 
was chosen as 108 x 108 to ensure that all diagonal subblocks are of equal size (3 x 54612/108 = 1517), 
while keeping the trade-off between the size of the subblocks and the computational cost of assembling 
and applying the preconditioner in mind. 
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Table 4 

Overview of the results for various preconditioning strategies for six concentric spherical shells 
discretized with 54612 particles. 



iterations 

Total CPU time [s] 

||U-MF||2/||U||2 

No precon 

331 

3988 

9.9 X lO-'^ 

Block diagonal 

432 

5202 

9.8 X 10“® 

IFMM {n = 3) 

20 

778 

8.3 X 10“® 


5. Conclusions and future work. In this paper, the inverse fast multipole 
method has been presented as a fast direct solver for linear systems of equations 
involving dense H^-matrices. The method is inexact, although an arbitrary high 
accuracy can be achieved by tuning the parameters appropriately. Because the error 
can be related to the decay of the singular values for low-rank fill-in blocks, we expect 
the error to decay roughly geometrically in many applications as a function of the 
rank. The overall computational cost scales like 0{r^N) where r is the average rank 
at the leaf level. In terms of the tolerance criterion e, the cost is 0(A^log(l/e)^). 

The algorithm is based on two key ideas. First, the H^-structure of the dense 
matrix is exploited to construct an extended sparse matrix. Second, fill-ins arising 
throughout the elimination of the latter are compressed and redirected if they cor¬ 
respond to well-separated clusters. As a result, the sparsity pattern is maintained 
(strictly speaking, there is a bound, independent of N, on the fill-in), leading to a 
fast solver that scales like 0{N) for 3D problems. Numerical examples have been 
presented to validate the linear scaling of the algorithm. It has furthermore been 
demonstrated that a low accuracy IFMM solver performs extremely well as a precon¬ 
ditioner for an iterative solver; using the IFMM as a preconditioner is often favorable, 
compared to its use as an accurate direct solver. Furthermore, an application related 
to the interaction of rigid bodies immersed in a Stokes fluid has been discussed to 
illustrate the method’s applicability to challenging science and engineering problems. 

An important aspect to consider in future work is the parallelization of the pro¬ 
posed methodology. The algorithm is expected to exhibit good parallel scalability, 
as each cluster only interacts with a limited number of other clusters (i.e., those in 
its neighbor and interaction list). Furthermore, alternative low-rank approximation 
schemes that might be more efhcient than interpolation based on Chebyshev poly¬ 
nomials can be investigated. Finally, the order of elimination of the nodes can be 
optimized (at each level of the tree) to minimize the amount of fill-in and the number 
of (re)compressions, and could potentially also improve the numerical stability of the 
algorithm. 

Although the examples presented in this paper only involve uniform octrees, the 
method can easily be applied to adaptive trees with a varying number of levels [45, 57]. 

Appendix A. Additional information - the IFMM as a direct solver. 

This appendix provides some additional data related to the benchmarks presented in 
Subsection 3.1. More specifically, the computational cost for the IFMM as a direct 
solver, as depicted in Figure 18, is analyzed in detail in the following. Figure 25 
shows a breakdown of the total CPU time into the time required for initializing the 
IFMM operators, estimating the largest singular value (To(E) of the extended sparse 
matrix E, and performing the elimination and substitution. The number of Chebyshev 
nodes (in each dimension) is varied from n = 1 to n = 4; an increase of n results in 
a higher overall computation time. It is clear from this figure that the elimination 
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phase strongly dominates the total computation time, while the contribution of the 
other components, and especially of the substitution phase, is almost negligible. 



1 2 4 8 16 32 64 128 256 512 1024 

(c) (d) 


2 


16 32 64 128 256 512 1024 

W[x 1000] 


Fig. 25. Breakdown of the total CPU time for the IFMM as a direct solver in case of N points 
randomly distributed in the cube [—1,1]®. The number of Chebyshev nodes (in each dimension) used 
for the initial low-rank approximations corresponds to (a) n = 1, (b) n = 2, (c) n = 3, and (d) n = A. 
From left to right, each bar represents: initialization, estimation ofagfE), elimination, substitution, 
and total computation time (brown bars). The elimination step (light green bars) dominates the total 
time in most cases as expected. 

A more detailed breakdown of the time spent on different types of operations 
throughout the elimination phase is provided in Figure 26. Following operations 
are distinguished: performing LU-factorizations and -solves, executing matrix-matrix 
multiplications, obtaining low-rank approximations through rSVD and ACA, and 
transferring operators to the parent level (corresponding to Eq. (26)-(28)). The com¬ 
putation of low-rank approximations is dominant in case of n = 1 and n = 2, while 
matrix-matrix multiplications are the most time consuming operations for n = 3 and 
n = 4. In all cases, the contribution of transferring operators from one level to the 
parent level is significantly less important than the other components. 

Finally, the different behavior of fill-ins arising between neighboring or well- 
separated clusters is illustrated. Figure 27 shows the singular values of some fill-ins 
E'(k, j) that are encountered throughout the elimination phase, both for neighboring 
{flj G A4) and well-separated (flj ^ A4) clusters ilj and This figure confirms 
that the decay of singular values is significantly faster in the case of well-separated 
clusters, indicating that E'(k, j) can then indeed accurately be approximated using a 
low-rank representation. 

Appendix B. Additional information - the IFMM as a preconditioner. 

The results presented in Subsection 3.2 have been obtained by applying the IFMM 
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1 2 4 8 16 32 64 128 256 512 1024 

(c) (d) 


4 8 16 32 64 128 256 512 1024 

iV[xl000] 


Fig. 26. Breakdown of the CPU time spent on different types of operations throughout the 
elimination phase of the IFMM algorithm. The number of Chebyshev nodes (in each dimension) used 
for the initial low-rank approximations corresponds to (a) n = 1, (b) n = 2, (c) n = 3, and (d) n = A. 
From left to right, each bar represents: block LU factorization and triangular solves, matrix-matrix 
products (trailing matrix update in the block LU factorization), low-rank approximations, operator 
transfer (from child to parent), and total elimination time (brown bars). For small n, the low-rank 
approximations (light green bars) dominate, while for large n, the matrix-matrix products (light blue 
bars) dominate. 



Fig. 27. 
well-separated 


Singular values ak of fill-ins E'(fc, j) arising between neighboring (black line) and 
(grey line) clusters Qj and Q/,. 


as a right preconditioner in GMRES. This appendix compares these results to a left 
preconditioning approach. 

The results of left and right preconditioning are compared in Table 5 and 6 for 
the two test cases considered in Subsection 3.2. Note that the ‘relative residual’ 
indicated in these tables corresponds to the actual residual ||b — Ax|| 2 /||b ||2 for right 
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preconditioning, while it equals ||P“^b — P”^Ax||2/||P“^b||2 for left preconditioning 
(with P“^ representing the IFMM preconditioner). 

For d = 10“^ (Table 5), the effectiveness of the IFMM as left or right precon¬ 
ditioner is similar, both in terms of the number of iterations and the relative error 
||x — x||2/||x||2. For d = 10“^ (Table 6) and n > 1, the left preconditioning ap¬ 
proach needs more iterations to converge to the specified residual, but it leads to a 
significantly smaller relative error in the solution. 

Table 5 

Overview of the results for left and right preconditioning for d = 10“®. 



iterations 

relative residual 

||x-x|| 2 /||x ||2 

n = 1 (right) 

19 

6.1 X 10"“ 

4.0 X 10-11 

n = 1 (left) 

19 

3.5 X 10"“ 

4.5 X 10-11 

n = 2 (right) 

6 

1.5 X 10"“ 

9.6 X 10-12 

n = 2 (left) 

6 

9.1 X 10-12 

9.2 X 10-12 

n = 3 (right) 

3 

5.2 X 10-11 

2.9 X 10-11 

n = 3 (left) 

3 

2.8 X 10-11 

2.8 X 10-11 

n = 4 (right) 

2 

2.7 X 10-11 

2.1 X 10-11 

n = 4 (left) 

2 

2.0 X 10-11 

2.0 X 10-11 


Table 6 

Overview of the results for left and right preconditioning for d = 10“^. 



iterations 

relative residual 

||x-x|| 2 /i|x ||2 

n = 1 (right) 

305 

9.7 X 10-11 

3.8 X 10-® 

n = 1 (left) 

303 

5.4 X 10-11 

5.2 X 10-® 

n = 2 (right) 

61 

8.3 X 10-11 

2.2 X 10-1 

n = 2 (left) 

62 

9.3 X 10-11 

5.2 X 10-® 

n = 3 (right) 

24 

7.3 X 10-11 

4.6 X 10-1 

n = 3 (left) 

27 

3.8 X 10-11 

2.0 X 10-® 

n = 4 (right) 

14 

7.3 X 10-11 

7.9 X 10-1 

n = 4 (left) 

18 

3.8 X 10-11 

1.6 X 10-® 


Appendix C. Classification and relation between different methods. We 

briefly present the different 'H-matrix formats and discuss connections between the 
algorithm presented in this paper and the fast solvers developed by Hackbusch et 
al. [4, 8, 25, 27, 28]. 

■H-matrices can be classified into four broad categories. 

1) wH with non-nested basis. Off-diagonal blocks that correspond to clusters 
that are well-separated are low-rank. The low-rank basis is non-nested, that is the 
low-rank basis of a node is unrelated to the basis of its children. This is the most 
general 'H-format. The matrix is generally speaking expanded in the form: 


In A 

A = A, + ^UzSiVf 

1=0 


where Ag is a block sparse matrix. 
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2) w'H with nested basis. The low-rank basis of a node is a linear combination of 
the basis of its children nodes. In that case, the expansion can be written in recursive 
form: 


A = A, + UA„V^ 

where A-u is a w'H matrix of smaller size. FMM matrices [11, 19], and the matrices 
discussed in this paper are of this form. This also corresponds to "H^-matrices [7, 26, 
27]. 

3) s'H with non-nested basis. All off-diagonal blocks are assumed to be low-rank. 
The matrix decomposition is: 

In A 

A = Ad + ^U,S/VF 

1=0 

where Ad is a block diagonal matrix. This is the same as the HODLR matrix for¬ 
mat [1]. 

4) sH with nested basis: 

A = Ad + UAhV^ 

where A-^ is an sTL matrix. This is the HSS format [12, 53]. 

We now discuss the connection between our algorithm and Hackbusch et al. [4, 8, 
27, 25, 28]. Both methods use a tree decomposition of the matrix in order to identify 
low-rank blocks. Although the Hackbusch algorithm was written differently, this algo¬ 
rithm can be derived using the extended sparse matrix introduced in Subsection 2.1. 

The key difference with this paper is that Hackbusch’s algorithm performs the 
LU factorization using a depth-first in-order (symmetric) traversal. In contrast, our 
method uses a breadth-first or level-by-level traversal, starting from the leaf level and 
moving up to the root. 

This can be illustrated on a simple example for concreteness. Consider the sim¬ 
plest 77-matrix: 


A = 



Uovn 

All J 


The extended sparse format is: 


VS' 

V 


All 

Vf 


Uo \ 
Ui 

-I 

-I / 


Hackbusch’s algorithm is based on a block LU factorization. Such a factorization 
is obtained if, in E, we eliminate first Aqo, then the auxiliary variables, then An. 
Step-by-step, let’s eliminate Aqo: 


/All Ui' 

-VS’Ao-o'Uo -I 

Vvf -I 
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If we eliminate pivots 2 and 3 (before An), we simply get: 

An-UiV^Ao-giUoYf 

This is the expected Schur complement from a block LU factorization. 

If we represent E using a tree, for a 3-level 'H-matrix, this elimination corresponds 
to the order shown in Figure 28. 



Fig. 28. Tree node elimination order following Hackbusch’s algorithm. The traversal is a 
depth-first in-order tree traversal. 


In contrast, our algorithm follows a level-by-level elimination. Going back to E, 
we first eliminate Aqo, then An. This leads to a new linear system involving the 
auxiliary variables only: 

/-V^AoVUo -I \ 

V -I -VfAn'uJ 

As a result, in terms of traversal ordering, this paper follows a level-by-level 
elimination. This is shown in Figure 29. 



Fig. 29. Tree node elimination order following this paper’s algorithm. The traversal is a 
level-by-level (breadth-first) traversal. 
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